Collecting Reliable Human Judgements on Machine-Generated Language: The Case of the QG-STEC Data

نویسندگان

  • Keith Godwin
  • Paul Piwek
چکیده

Question generation (QG) is the problem of automatically generating questions from inputs such as declarative sentences. The Shared Evaluation Task Challenge (QG-STEC) Task B that took place in 2010 evaluated several state-of-the-art QG systems. However, analysis of the evaluation results was affected by low inter-rater reliability. We adapted Nonaka & Takeuchi’s knowledge creation cycle to the task of improving the evaluation annotation guidelines with a preliminary test showing clearly improved inter-rater reliability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

Machine Learning and Citizen Science: Opportunities and Challenges of Human-Computer Interaction

Background and Aim: In processing large data, scientists have to perform the tedious task of analyzing hefty bulk of data. Machine learning techniques are a potential solution to this problem. In citizen science, human and artificial intelligence may be unified to facilitate this effort. Considering the ambiguities in machine performance and management of user-generated data, this paper aims to...

متن کامل

A New Evaluation Approach for Sign Language Machine Translation

This paper proposes a new evaluation approach for sign language machine translation (SLMT). It aims to show a better correlation between its automatically generated scores and human judgements of translation accuracy. To show the correlation, an Arabic Sign Language (ArSL) corpus has been used for the evaluation experiments and the results obtained by various methods.

متن کامل

Comparing Automatic Evaluation Measures for Image Description

Image description is a new natural language generation task, where the aim is to generate a human-like description of an image. The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements. The focus of this paper is to determine the correlation of automatic measures w...

متن کامل

ارائه یک سیستم هوشمند و معناگرا برای ارزیابی سیستم های خلاصه ساز متون

Nowadays summarizers and machine translators have attracted much attention to themselves, and many activities on making such tools have been done around the world. For Farsi like the other languages there have been efforts in this field. So evaluating such tools has a great importance. Human evaluations of machine summarization are extensive but expensive. Human evaluations can take months to f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016